Meta-Learning for Phonemic Annotation of Corpora
نویسندگان
چکیده
We apply rule induction, classifier combination and meta-learning (stacked classifiers) to the problem of bootstrapping high accuracy automatic annotation of corpora with pronunciation information. The task we address in this paper consists of generating phonemic representations reflecting the Flemish and Dutch pronunciations of a word on the basis of its orthographic representation (which in turn is based on the actual speech recordings). We compare several possible approaches to achieve the text-topronunciation mapping task: memory-based learning, transformation-based learning, rule induction, maximum entropy modeling, combination of classifiers in stacked learning, and stacking of meta-learners. We are interested both in optimal accuracy and in obtaining insight into the linguistic regularities involved. As far as accuracy is concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at word level) for single classifiers is boosted significantly with additional error reductions of 31% and 38% respectively using combination of classifiers, and a further 5% using combination of meta-learners, bringing overall word level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We also show that the application of machine learning methods indeed leads to increased insight into the linguistic regularities determining the variation between the two pronunciation variants studied.
منابع مشابه
Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in desig...
متن کاملThe Effect of Transcribing on Beginning Learners’ Phonemic Perception
A large number of studies dealing with phonology have focused their attention on phonological production at the expense of phonological perception which provides the foundation stone for phonological production. This study focuses on phonological perception at phonemic level. The purpose of the study is helping beginning learners improve their perception of the English phonemes which are confus...
متن کاملMeta-Learning with Selective Data Augmentation for Medical Entity Recognition
With the increasing number of annotated corpora for supervised Named Entity Recognition, it becomes interesting to study the combination and augmentation of these corpora for the same annotation task. In this paper, we particularly study the combination of heterogeneous corpora for Medical Entity Recognition by using a meta-learning classifier that combines the results of individual Conditional...
متن کاملThe Effects of Multimedia Annotations on Iranian EFL Learners’ L2 Vocabulary Learning
In our modern technological world, Computer-Assisted Language learning (CALL) is a new realm towards learning a language in general, and learning L2 vocabulary in particular. It is assumed that the use of multimedia annotations promotes language learners’ vocabulary acquisition. Therefore, this study set out to investigate the effects of different multimedia annotations (still picture annotatio...
متن کاملMeta-Knowledge Annotation at the Event Level: Comparison between Abstracts and Full Papers
Biomedical literature contains rich information about events of biological relevance. Event corpora, containing classified, structured representations of important facts and findings contained within text, provide an important resource for the training of domain-specific information extraction (IE) systems. Such corpora pay little attention to the interpretation of events, e.g., whether an even...
متن کامل